计算机与现代化 ›› 2010, Vol. 1 ›› Issue (6): 137-0139.doi: 10.3969/j.issn.1006-2475.2010.06.039

• 网络与通信 • 上一篇    下一篇

基于向量空间模型的网页特征权重计算改进

李中原,杨守文   

  1. 北京化工大学信息科技学院, 北京 100029
  • 收稿日期:2010-02-05 修回日期:1900-01-01 出版日期:2010-07-01 发布日期:2010-07-01

Improvement of Weight of Web Page Features in Calculation Based on VSM

LI Zhong-yuan, YANG Shou-wen   

  1. College of Computer Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China
  • Received:2010-02-05 Revised:1900-01-01 Online:2010-07-01 Published:2010-07-01

摘要: 采用经典的向量空间模型对网页文本进行分类。由于传统特征项权重计算公式 TFIDF 在网页关键词计算和关键词类间区分度不高等问题的存在,本文将网页结构分成两个部分,含有标题、元数据、链接锚文件等的关键词部分和网页的正文部分,对关键词部分的权重进行了加强,而对网页正文部分采用改进的 IDF 进行计算,使关键词在类的区分度的效果上得到一定程度的提升,试验证明该方法是可行的。

关键词: 向量空间模型, 特征表示, TFIDF

Abstract: This paper uses the classical vector space model for text classification Web page. The weighting of traditional TFIDF formula exists some problems, such as the Web page keywords calculation, the differentiation between keywords is not high. This Web page structure is divided into two parts, one part containing the title, meta data, link anchor documents and Web pages keywords, another part containing the Web page body, and the weighting of the keywords is strengthened. Because the part of page body calculation adopts the improved IDF, so the keywords in the class differentiation effect are promoted to a certain extent. After the test, it proves that the method is feasible.

Key words: VSM, feature representation, TFIDF

中图分类号: